Machine learning models can predict subsequent publication of North American Spine Society (NASS) annual general meeting abstracts

Background context Academic meetings serve as an opportunity to present and discuss novel ideas. Previous studies have identified factors predictive of publication without generating predictive models. Machine learning (ML) presents a novel tool capable of generating these models. As such, the objective of this study was to use ML models to predict subsequent publication of abstracts presented at a major surgical conference. Study design/setting Database study. Methods All abstracts from the North American Spine Society (NASS) annual general meetings (AGM) from 2013–2015 were reviewed. The following information was extracted: number of authors, institution, location, conference category, subject category, study type, data collection methodology, human subject research, and FDA approval. Abstracts were then searched on the PubMed, Google Scholar, and Scopus databases for publication. ML models were trained to predict whether the abstract would be published or not. Quality of models was determined by using the area under the receiver operator curve (AUC). The top ten most important factors were extracted from the most successful model during testing. Results A total of 1119 abstracts were presented, with 553 (49%) abstracts published. During training, the model with the highest AUC and accuracy metrics was the partial least squares (AUC of 0.77±0.05, accuracy of 75.5%±4.7%). During testing, the model with the highest AUC and accuracy was the random forest (AUC of 0.69, accuracy of 67%). The top ten features for the random forest model were (descending order): number of authors, year, conference category, subject category, human subjects research, continent, and data collection methodology. Conclusions This was the first study attempting to use ML to predict the publication of complete articles after abstract presentation at a major academic conference. Future studies should incorporate deep learning frameworks, cognitive/results-based variables and aim to apply this methodology to larger conferences across other fields of medicine to improve the quality of works presented.


Study design/setting
Database study.

Methods
All abstracts from the North American Spine Society (NASS) annual general meetings (AGM) from 2013-2015 were reviewed. The following information was extracted: number of authors, institution, location, conference category, subject category, study type, data collection methodology, human subject research, and FDA approval. Abstracts were then searched on the PubMed, Google Scholar, and Scopus databases for publication. ML models were trained to predict whether the abstract would be published or not. Quality of models was determined by using the area under the receiver operator curve (AUC). The top ten most important factors were extracted from the most successful model during testing.

Results
A total of 1119 abstracts were presented, with 553 (49%) abstracts published. During training, the model with the highest AUC and accuracy metrics was the partial least squares

Introduction
In the scientific community, academic meetings serve as a means of disseminating new research information in a timely manner. In fact, 50-60% of the chapters of major orthopaedic textbooks included results from abstracts presented at international meetings [1]. Despite this fact, peer-reviewed journal articles remain the gold-standard for obtaining new research information [2]. As such, manuscript publication in a peer-reviewed journal is generally the goal for research presented at academic meetings and conferences.
The goal of abstracts eventually being published in peer-reviewed journals has led to a plethora of research dedicated to evaluating conference abstracts, with a major focus on evaluating publication rates of complete articles after presentation at an academic meeting [3][4][5][6]. An example is the 5-year abstract-to-publication ratio that was proposed as a metric to assess the scientific value and quality of academic meetings [7], with few studies addressing factors predictive of manuscript publication after abstract presentation [1,8]. As of now there have been no attempts to predict the chance of publication.
However, machine learning models (MLMs) have recently come to the fore in clinical research because of their predictive abilities, typically being used to predict clinical outcomes, improve system efficiencies and risk stratify patients [9][10][11]. As such, research on publication rates of conference abstracts is a novel opportunity to test MLMs in this emerging domain of academic medicine. Using MLMs, we can improve the quality of research by not only identifying factors predictive of publication as done with standard statistics, but further expand to determine the probability of publication. To date, MLMs have not been used to predict conference abstract publication after presentation at scientific meetings.
The North American Spine Society (NASS) Annual General Meeting (AGM) is an internationally recognized conference on spine surgery. As such, it provides an excellent platform for researchers and clinicians worldwide to showcase the latest advancements in spine surgery. This makes it an ideal conference to use to test the hypothesis of whether artificial intelligence (AI) methods such as MLMs may be used to detect subsequent publication after presentation.
As such, this study is primarily a feasibility study to evaluate the ability of MLMs to predict subsequent publication of abstracts presented at the NASS Annual General Meeting (AGM). Secondary objectives included stratifying the publication rates of these abstracts, as well as determining the predictive factors used by the most accurate model. We hypothesize that MLMs can be used to predict the publication rates of NASS AGM abstracts with acceptable accuracy.

1 Selection of conference abstracts
All abstracts from the NASS AGM from 2013-2015 were reviewed. The 2015 annual general meeting was chosen as the upper limit as previous literature has shown that for most abstracts presented at conferences, subsequent publication most commonly occurs within 5 years of conference date [7]. Only abstracts directed at scientific research were included (i.e., workshops and educational exhibits were not included).

2 Data extraction
The following information was extracted from the abstracts: abstract identification number, title, authors, number of authors, institution of first author, location (city, state/province, country), conference category, subject category, study type, data collection methodology, human subject research, and FDA approved indication.

3 Identification of full-text publication
Abstracts were searched on the PubMed, Google Scholar, and Scopus databases for publication. Search date was inclusive of the deadline to submit an abstract to NASS AGM 2013 until 31st December 2020, which includes up to 1 year before NASS AGM 2013. The abstract title was searched and matched with authors. Only English language abstracts which were published in English or available in English were included.
Abstracts were searched with the first author's last name and first initial to identify articles published with a similar title. If a manuscript was not discovered, the process was repeated for the second author. This would continue until all authors were searched. Abstracts were considered published if the following minimal criteria were met: 1) at least one author from the abstract was included in the manuscript; 2) at least one conclusion from the abstract was identical to the publication. An abstract was considered not published if the above method did not yield a manuscript. In cases where authors published multiple works with similar conclusions, all conclusions from the abstract would need to match the published paper.
In cases where a given author would have published multiple abstracts on the same topic, the methodology and conclusion were compared in detail to find an identical match to the original abstract. In cases where abstracts were presented at multiple categories of a conference, only the first presentation of the abstract was retained.
Subsequent publication was considered a binary outcome (i.e., yes if the abstract is published, no if not). For published abstracts, the following information was extracted: journal of publication, impact factor, date of publication, country of journal, NASS journal or not, and manuscript conclusion same as abstract. Impact factor was retrieved using the Web of Science Master Journal List (https://mjl.clarivate.com/home). If the impact factor was not found in the master list, the latest impact factor from the journal website or Google search was obtained and used. If the journal was no longer active, then the latest available information for it was used. The latest 2-year impact factor available (2018) was included.
Data extraction was performed using the Excel Software Package (Microsoft Corporation, Redmond, Washington). Continuous variables were collated and reported with a mean and standard deviation, while categorical variables were reported with total counts (N) and percentages.

4 Prediction of publication
MLMs were trained to predict whether the abstract would be published (yes prediction) or not published (no prediction). The dataset was randomized and split into a training/testing ratio of 80/20 using stratified sampling to reduce class imbalance, with MLMs trained on the training set and tested on the testing set.
2. 4. 1 Independent variables. Independent variables (features) selected for analysis include number of authors, institution of authors, location, conference category, subject category, study type, data collection methodology, and human subjects research. Refer to Table 1 for a detailed description of these variables. The correlation between features was examined by determining the Pearson correlation coefficient between each feature, with results visualized through a network plot. Features with absolute correlation coefficients > 0.5 were eliminated to avoid overfitting.

4. 2 Missing data.
For each predictor variable, an imputer was trained on the training set to input the missing data. A k-nearest neighbour algorithm was used to replace the missing data, with the weighted mean of the data used from the 10 nearest neighbours. Once trained and applied to the training set, the imputer was applied to the testing set to fill in any missing feature data.

4. 4 Model training and validation.
Repeated k-fold cross validation was used on the training set to hyperparameter tune the models. Specifically, 10-fold cross validation with 5 repeats was used with a grid-search optimization approach to determine the model parameters such that the Receiver Operator Curve (ROC) metrics was maximized. Mean and standard deviation (SD) across of the following metrics were determined for each model across the repeated k-folds for the training dataset: area under the receiver operator curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV).

4. 5 Model testing.
Once trained and tuned, each model was applied to the test dataset. AUC, accuracy, sensitivity, specificity, PPV and NPV for each model was determined. In addition, confusion matrices representing the full distributions of the predictions for each model were determined. The model with the largest AUC was considered the most optimal. Statistical significance of these models during training and testing were determined by comparing the AUC metrics using analysis of variance (ANOVA) with Tukey's test post-hoc sampling. Significance level was set at p < .05. Since there is no current literature defining the acceptable accuracy for MLMs in the context of publication rates, minimally acceptable definition for accuracy was set at 60% as this is slightly higher than a random prediction of 50%.
2. 4. 6 Feature importance. Feature importance was extracted and normalized with respect to the most important feature (%) for the model with the highest AUC on the test set. All statistical analysis was completed using RStudio (RStudio Inc; Version 1.2.5042) [18], running R software (R Foundation, Version 3.6.3) [19].

Results
A total of 1119 abstracts were presented at NASS AGM between 2013-2015, with 553 (49%) of the abstracts published in scientific journals. The training and testing sets consisted of 896 and 223 abstracts, respectively. The median time to publication (years) was 1.25, with the mean time to publication being 1.33 (SD 1.49). The median impact factor of journals was 2.90, with the mean impact factor being 3.34 (SD 4.36). Across published abstracts, the median number of authors was 6, with the mean number of authors being 6.67 (SD 4.26). A breakdown of abstracts according to country, subject category, study type, data collection methodology, human subjects research and FDA approval status is found in Table 2. S1 Table and 2 provide these data for the training and testing set. There were no missing features in the dataset as shown by the aforementioned tables.

1 Publication results
The United States of America (USA) had the highest number of abstracts published (434, 50.4%). In terms of categories, 'Other' had the highest percentage of abstracts published (478, 53.1%), followed by 'Trauma' (50, 5%) and 'Degenerative' (131, 51.9%). Systematic review studies had the highest percentage of publication (37, 56.8%), followed by animal studies (28, 53.6%) and basic science studies (73, 53.4%). Studies with more than 10 authors had the highest percentage of publication (53.4%), followed by 6-10 authors (52.2%) and then less than 5 authors (46.1%). 3. 2. 1 Model results. During training, the models that generated the highest AUC metrics were the random forest and partial least squares models, with an AUC of 0.77 (SD 0.05) ( Table 3). The worst model was the K-nearest neighbours, with an AUC of 0.75 (SD 0.05) and accuracy of 71.3% (SD 2.8). The partial least squares model had the highest accuracy during training of 75.5% (SD 4.7). All models had good AUC scores [20], with all scores being above 0.75. All models had similar sensitivity and specificity except the naive bayes model, with its sensitivity (correctly identifying published abstracts) being well below the others at 0.28 (SD 0.08) and specificity (correctly identifying unpublished abstracts) nearly perfect at 0.94 (SD 0.07).

2 Machine learning predictions
During testing, the models that generated the highest AUC metrics were the random forest, naive bayes, partial least squares, and general boosted linear models AUC of 0.69 (Table 3). The worst MLM was the K-nearest neighbours, with an AUC of 0.67 and accuracy of 66.1%. The random forest model had the highest accuracy of 67.0%. All models had acceptable AUC scores, with AUCs above 0.67 (Table 3). Like training, all models had similar sensitivity and specificity except the naive bayes model, with its sensitivity being much lesser than the others at 0.25 and specificity much higher at 0.93 (Table 3).
ANOVA analysis comparing the AUC metrics of different models during training and testing revealed a significant difference between all models (p < .0001). Post-hoc sampling demonstrated the largest difference between the random forest and naive bayes models, with a difference of -0.113 (95% confidence interval (CI): -0.115, -0.111), p < .0001.
Confusion matrices in Fig 3 provide detailed insight into the efficacy of the models on the test set. Each of these plots may be interpreted like a 2-by-2 epidemiologic table, with true positives (predicting publication when published) and true negatives (predicting not published when not published) on the top left and bottom right corners and false positives (predicting publication when not published) and false negatives (predicting no publication when published) in the top right and bottom left corners [21]. All models seem to perform similarly on the test set, with the random forest performing best and the naive bayes performing the worst. Despite the Naïve Bayes model having the second highest AUC, its distribution of predictive accuracy is distinct from the others (Fig 3).

2. 2 Feature importance.
The top ten features of the random forest model are displayed in Fig 4. According to the model, the features considered most important in descending order were number of authors (increased number increasing chance of publication), year,

Discussion
Our results show that approximately half of the abstracts at the NASS AGM from 2013-2015 were eventually published (~50% 5-year abstract-to-publication ratio). Our results are comparable to the 43.8% and 51% NASS AGM abstract publication rate reported by previously published studies [22,23]. To our knowledge, this was the first study using machine learning approaches to predict the publication of complete articles after abstract presentation at an academic meeting/conference. Our purpose was to show the feasibility of using ML to predict publication, which has

PLOS ONE
been shown with the independent variables chosen. This should provide researchers a framework on which to build ML models that can answer future questions using other variables as deemed relevant for a particular question.
In this study, we tested different machine learning algorithms (random forest, partial least squares, k-nearest neighbours, naive bayes, neural network, general boosted linear model) to determine which of them performs best at predicting publication outcomes. The performance of each algorithm is measured by the metric AUC, with AUC values ranging between 0 and 1. The higher the AUC, the better the model is at predicting abstracts as published versus not published. For reference, an algorithm making random guesses as to which abstracts eventually get published will have an AUC of 0.5. With an AUC ranging from 0.69-0.77 (testing-training), the random forest machine learning algorithm performed 38%-54% better than random guessing. In addition, with accuracy ranging from 67.0%-73.7% (testing-training), the random forest model performed 37%-50.4% better than random guessing (49% accuracy). Despite not being near perfect in terms of AUC and accuracy (AUC >0.9, accuracy >90%), this demonstrates MLMs may generate predictions significantly better than random guessing. Given this is a relatively unexplored field, this improvement in performance is a noteworthy finding.
The top performing MLM (i.e., random forest) placed emphasis on the number of authors, year, conference category and subject category when generating the probability of prediction. A notable finding is that with regards to number of authors, abstracts with more authors were more likely to be published. This may be related to more authors indicating a higher degree of collaboration and thus more motivation for publication, but this is just conjecture. In addition, this insight could provide valuable information to future conference organizers when analysing the quality of their conferences. In addition, it could also help conference organizers improve the quality of research presented via stratifying the abstracts according to probability of publication, ensuring that research of potential greater scientific impact is highlighted as part of the conference's program [24]. A common concern in machine learning literature is the quality and size of the dataset used. In this work, a detailed dataset was used, with 1119 abstracts and numerous features. In the previous studies of NASS abstracts [22,23], the type of presentation (poster vs. podium) and "Best" or "Outstanding" abstract designation were identified as factors that increased the likelihood of publication, which was different from the features deemed important by our study (Fig 4). These predictors of publication, nevertheless, were analysed using linear regression models. As the scientific community continues to incorporate machine learning into various aspects of research, methods of analysing conference abstracts should evolve as well. Machine learning enables us to deviate from the surface level analysis of publication rates/associations and attempt a detailed examination to uncover features that are important and useful for prediction, given its ability to model complex non-linear relationships between various features [22,25,26]. Our study, for example, used machine learning to identify specific predictive features, combining them to create a well performing predictive model, something which cannot be achieved with such granularity using standard statistical methods. Our study may provide some insight into some biases that may exist in acceptance of manuscripts for publication. For example, a major predictor of publication in the top performing model (random forest) was number of authors, which may speak to a biasing of more authors on abstracts increasing their likelihood of publication. This could provide insight into journals adjusting their requirements for acceptance by using these MLMs to create more equitable acceptance criteria. Ultimately, the application of machine learning to this field is an exciting opportunity for continuous improvement, development, and evaluation initiatives of scientific conferences.
This study has some limitations. Firstly, some features included in the algorithm (such as conference categories) vary widely from conference to conference. This means that these models would have limited utility where these features do not exist. However, future work should  [23]). This approach would be challenging to apply to some high impact scientific meetings that are smaller in size by virtue for meeting focus and target audience. The selection process of abstracts for presentation and specialized focus of NASS on interdisciplinary spine advances may differ to the mandate of other scientific medical conferences and the results of this study may not be generalizable to what may be the most important factors predicting publication for other societies. Additional information that may not be available based upon abstracts reviewed such as grant funding and other support may further improve the overall publication prediction. Fourthly, manuscript publication is not necessarily the goal for every abstract presented at research meetings. Many presentations are done to fulfil research requirements of residency or fellowship, which would hamper the ability of MLMs to predict subsequent publication. In addition, submissions to journals varies by topic. As such, certain topics may receive more submissions than others, resulting in acceptance and rejections rates. Furthermore, our study is unable to factor in how the laborious process of preparing a manuscript for publication factors in publication rates. However, the results of this study may provide insight into what manuscripts would be more successful in publication, perhaps streamlining this process. Lastly, deep learning frameworks were not used as these generally require dataset sizes on the order of tens of thousands. In the future, it would be interesting to gather this data over a larger interval e.g., 20 years to evaluate the effect of time on results and models, also allowing one to incorporate deep learning prediction models. Deep learning will allow one to model complex, nonlinear relationships which may arise with time better than more conventional ML models [28].
In conclusion, this study demonstrated that machine learning can be used to predict article publication from abstract presentations at the NASS AGMs. Cognitive variables that are based on the abstract content undoubtedly have significance on publication, which were not included in this study. However, having shown the feasibility of our models as is, further study using these cognitive based variables is indicated. The best model on the test set was the random forest algorithm with the highest AUC and accuracy. Future work should also incorporate a larger database over a longer period and include more features specific to the authors (such as level of training). Overall, incorporation of these MLMs can provide insight into the current biases that exist in publication, improve the quality of scientific meetings, and guide authors into creating the optimal manuscript suitable for publication.
Supporting information S1